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Abstract 

In this paper we introduce a novel method to conduct inference with mod- 
els defined through a continuous-time Markov process, and we apply these 
results to a classical stochastic SIR model as a case study. Using the inverse- 
size expansion of van Kampen we obtain approximations for first and second 
moments for the state variables. These approximate moments are in turn 
matched to the moments of an inputed generic discrete distribution aimed 
at generating an approximate likelihood that is valid both for low count or 
high count data. We conduct a full Bayesian inference to estimate epidemic 
parameters using informative priors. Excellent estimations and predictions 
are obtained both in a synthetic data scenario and in two Dengue fever case 
studies. 

Keytuords: Surrogate model. Bayesian inference. Chemical master equation 



1. Introduction 

Stochasticity and nonlinearity have a major role in shaping the dynamics 
of epidemics of infectious diseases. It is known that the effects of demo- 
graphic stochasticity weight more in determining the dynamics of epidemics 
when the number of individuals in the population is low. Consequently, 
the development of mathematical epidemic models that take into account 
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uncertainty and are amenable to describing small populations is an active 
field of research, see [H El El HI El E] . Among these modeling efforts, mod- 
els based on Markov processes have received a lot of attention. It has been 
argued that continuous-time discrete-space (CTDS) Markov processes whose 
forward-Kolmogorov equation is known as the chemical master equation 
(CME) represent the stochastic counterpart to systems of ordinary differ- 
ential equations and are often referred to as birth-and-death processes. 
The research presented in this paper is based on this modeling paradigm. 
We consider the stochastic SIR epidemic model study. The SIR 

model and its variants are ubiquitous in the study of dynamics of epidemics 
in populations of plants and animals [HI E] • Seasonality, migration, vaccina- 
tion, demographic and environmental stochasticity are among the epidemics 
features that give rise to variants of the SIR model. There is a wealth of 
qualitative results that have been obtained from the analysis of these mod- 
els, e.g. thresholds for the onset of epidemics, amplification of environmental 
and demographic noise, mechanisms for local extinction and invasion of epi- 
demics. Feedback from the bifurcation analysis of deterministic SIR models 
flUl [TTl [T2| [T3] and the analysis of the interplay of stochasticity and nonlin- 
earity [HI [15] has provided substantial insight regarding epidemic dynamics. 
A related topic is the quantitative study of the predictive capacity of epi- 
demic models. Recently, parameter estimation of epidemic models, from a 
time series analysis perspective, and with varying degrees of formality have 
been attempted [161 HH UHl [THl EO] . 

Recently, various Bayesian and likelihood based approaches to parameter 
inference in CME type models have been attempted. Some authors have tried 
to work with a deterministic, continuous or steady state alternative model to 
infer the parameters in the CME [211 [22], but with modest success; specially 
in the important scenario of low number of counts for the species, when 
the stochastic evolution of the system is in fact the main object of interest 
|23j . In particular, previous research efforts to conduct Bayesian inference 
with the CME include [211 [22], where the CME is approximated by a diffusion 
process and [25] where the van Kampen expansion is used to derive a diffusion 
approximation; other similar papers include [261 12Z] • However, this approach 
applies when individual counts are large and, in the limit, fluctuations are 
small and in turn approximated with a Gaussian diffusion ^61 p. 246] which, 
certainly, will be of limited use if the low counts for some or all species are 
an issue, as it is the case with epidemic studies. 

In the chemical master equation scenario simulating data is possible (see 
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Section [2]) and the likelihood free Approximate Bayesian Computational 
(ABC) inference ideas may be used [28]. The ABC simulates data for fixed 
parameters and rejects trajectories that are not "close" (according to an ad 
hoc metric among an ad hoc statistic) to available data. However, as of yet 
the ABC lacks of a firm theoretical support and many crucial details remain 
unknown, namely, what regularity conditions are required (if any) for the 
ABC to work, how to choose the tolerance and metric and the consequences 
of not using a sufficient statistic [28]. Moreover, when the number of reac- 
tions per unit of time increases the simulation algorithm may become very 
inefficient. [22] represents an attempt to use the ABC in this context, for 
model inference and model selection; however a very recent result [30] will 
severely question the validity of the latter. Other Bayesian inference ap- 
proaches for the CME are [311 132] and will be discussed in further detail in 
Section HI 

In this paper we describe a method to conduct parameter estimation in 
the CME from partial observation of the state variables. We have used the 
van Kampen expansion to obtain approximate equations of the dynamics 
of the first two moments of the state variables of the CME to impute a 
counting model matching these moments to create a likelihood for Bayesian 
inference. We remark that our results hold for low populations. For the 
sake of clarity, our examples use the simplest epidemic model, e.g. the SIR 
model without vital dynamics. It is shown that inference is possible with 
the approximate likelihood using synthetic data generated with the Gillespie 
algorithm and the stochastic SIR model. We offer results with data from 
Dengue Fever onset from two cities central Mexico, Acapulco in 2005 and 
Cuernavaca in 2008. Our contention is that the results presented in this 
paper hold for models formulated as a CME under the hypothesis of spatial 
homogeneity and thermodynamic equilibrium, where only mono-molecular 
and bimolecular reactions occur. 

The paper is organized as follows. In the next section we explain the 
SIR model and the moment approximations that we have used, in Section [3] 
we explain the details of the Bayesian inference and in Sections |4] and [5] we 
present examples with synthetic and real data, respectively. In Section |6] a 
discussion of the paper is presented. 
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2. The Stochastic SIR model and moment approximations 

Gardiner [33] claims that it is appropriate to describe the dynamics of 
intrinsically discrete systems such as epidemics in terms of jumps. Also, 
establishes that the chemical master equation (CME) offers a complete de- 
scription of such systems since the CME embodies macroscopic deterministic 
laws of motion about which of the stochastic nature of the system generates 
a random part. However, the chemical master equation accounts only for lo- 
cal (demographic) sources of stochasticity. Presumably, a complete model of 
an ecological system such as epidemics should also account for global (envi- 
ronmental) sources of stochasticity such as weather. Nevertheless, a general 
framework to derive stochastic models that takes into account both sources 
of stochasticity is not available. Although it is possible to couple the chem- 
ical master equations with models of extrinsic noise [H E], in this paper we 
restrict ourselves to the CME of the classical SIR epidemic model to conduct 
Bayesian inference. 

We present briefly the CME as in [32]. For u 'species' X = (Xi, . . . ,X„) 
and V 'reactions' it is assumed that for a small enough time interval At only 
one of the following v reactions takes place 

Rk '■ Pkl^l + • • • + Pku^u — ^ Qkl^l + . . . + Qku^u 

where k = 1,2, . . . ,v and pkj is the 'stoichiometry' of reactant j in reaction 
k and qkj is the 'stoichiometry' of product j in reaction k, see [32] for more 
details. This means that if reaction k takes place Xj changes by qtj — Pkj- 
Each reaction Rk has an associated reaction rate hk(X.,6k) (we assume here 
that 6k is a real positive parameter). Once a reaction has occurred (or at 
t = 0) the time to the next reaction has an exponential distribution with 
rate ho(X.,9k) = Yll=i^k(X.,6k), and the kth reaction occurs with prob- 
ability hk(X.,9k)/ho(X.,0k)- This constitutes a pure Markov jump process 
(continuous time) and since only the exponential and a discrete distribution 
is involved it is straightforward to simulate from; this is the so called Gille- 
spie simulation method [51]. The probability law governing the behavior of 
this Markov process is the CME. This model has been applied in a diversity 
of fields, e.g. biochemical reactions [35J, viral infections [36] and ecology |37] . 
Although in the following we concentrate on a much simpler setting (the 
SIR model) we keep this general model in perspective. In fact, our inference 
and prediction procedure may be considered in this general setting as far as 
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the rates hk(K,6i^) are polynomial, i.e. Kurtz theorem for the Fokker-Plank 
approximation holds, see Gardiner [33] . 

2.1. The SIR model 

As mentioned in Section [T| here we will work on a simple epidemic of 
the SIR type (susceptible, infectious, recovered). The stochastic approach 
to modeling epidemics has a long history and dates back to the pioneering 
work of Kermack and McKendrick [381 EH HQ]- The main hypothesis of the 
model is that contact rates occur according to a mass-action law and there is 
no demographic dynamics, implying that the time scale of the model is the 
length of a single epidemic event. The classical approach has been particu- 
larly valuable when small population sizes do not support the assumptions 
of the deterministic ODE models. The reader is referred to Daley and Gani 
[HI for a classical and standard derivation of the basic SIR epidemic model. 

Let the random variables Xi, X2 and X3 denote respectively the ( ' 
'species") number of susceptible, infected and recovered individuals in a 
closed population. The stochastic model is defined by two possible "reac- 
tions": infection and removal, which occur with "propensities" (rates) hi 
and /12 respectively: 

i?i : Xi + X2 ^ 2X2 infection 
i?2 • ^2 X3 removal. 

Here we have v = 3 species and u = 2 reactions. The reaction rates are given 
by /ii(X2,6o) = ^0-^2/^ and h2 = 61, for some positive parameters bQ,bi and 
Q to be explained below. 

If we denote by x = (xi,X2,X3) a realization of the random variables 
X = (Xi,X2,X3), and let P(xi,x2,x3)(t) be the probability that the system is 
in state x = (xi, X2, 0:3) at time t, then the chemical master equation for this 
system is given by 

dP(xuX2,X3){t) _ ^ (a^2 - 1) ^ .,^p 

- Oo ^ [Xl + i-)r^{xi+l,X2-l,X3)[t) 

+ hi {X2 + l)P{xi+l,X2-l,xz-l) (t) (1) 

- (^o|^a;i + biX2)P{xi,x2,x3){t). 
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Let 



where yi y Zi, i = 1,2,3 are respectively realizations of the means Yi and 
the fluctuations Zi of the state variables X^. Applying the van Kampen's 
inverse size expansion [33] to equation ([T]) leads to (deterministic) differential 
equations for the dynamics of the means 

2/1 = 1^2/1 (2) 

2/2 = ^0 1^2/1 - (3) 
2/3 = &12/2- (4) 

Equations ([2])-(|4]) represent the macroscopic limit of the CME, and coin- 
cide with the classical deterministic SIR model. From the van Kampen's 
expansion we obtain also a linear Fokker-Plank partial differential equation 
governing the dynamics of the fluctuations of the state variables 



i,3 *J 



where 



-&02/2 -hoy I 
A = I -&02/2 -&02/1 - &i 
bi 




&02/12/2 -^02/12/2 
B = I -6o2/i2/2 &02/12/2 + &i2/2 

-6i?/2 

and = "^{zi, Z2, Z3,t) is the probability that the fluctuations are in state 
(zi, Z2, Z3) at time t, see |l2l [5]. From equation (|5]) it can be readily estab- 
lished that the first and second moments of the fluctuations obey the 
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following differential equations 



dE[zk] 



3 



dt 



J2 ^ik^Zk] 



(6) 



k 



for k = 1 




impute a counting model aimed at conducting Bayesian inference with the 
CME. 

2.2. Synthetic data 

In order to generate synthetic trajectories of the stochastic SIR model 
in exact accordance with the CME a simulation method was outlined in 
Section [2| namely the Gillespie simulation algorithm [31] . For a moderate 
number of reactions, the Gillespie algorithm is feasible. A sub sampling of 
the resulting trajectories will generate data exactly distributed as the CME 
model. We will use this procedure to simulate synthetic data in Section |4] 
However, the moment approximations in (|2])-([7]) are used to impute a model 
for the data and this is used to perform our inferences. This in fact avoids the 
testing strategy known as "inverse crime" , see |13] , where the same theoretical 
ingredients are used to synthesize and to invert data (infer parameters) in an 
inverse problem. We explain this inference procedure in the next Section. 

3. Bayesian inference and MCMC 

Recently, various Bayesian and likelihood based approaches to parameter 
inference in CME type models have been attempted. Some have tried to work 
with a deterministic, continuous or steady state alternative model to infer 
the parameters in the CME [2T1 122] , but with modest success; specially in the 
most important scenarios with low number of counts for the species, when the 
stochastic evolution of the system is in fact the main object of interest [231 E2]- 
Note that, in the artificial extreme case when observations are available for 
all reactions in all species inference is straightforward since the problem 
becomes one of estimating the rates in exponential sampling. If in addition 
it is assumed that the reaction rates have the form /ifc(X, 6k) = OkQkidk) even 
a simple (Gamma) Bayesian conjugate inference can be conducted [32] . 
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Indeed, data is never complete in real CME applications and the inference 
problem could be regarded as one of missing data [S]. [32] takes the latter 
perspective to develop a full Bayesian approach. A consistent handling of 
missing data is available in the Bayesian paradigm were a complete data set 
is simulated from the predictive distribution of missing data given the actual 
observations and, in turn, the unknown parameters are simulated form the 
complete data posterior (the analytic version of this is totally out of the 
question in this context). This creates a two stage sampling procedure that 
would generate samples from the correct posterior distribution. The problem 
in this context is simulating complete data sets given observations at fixed 
time points of typically few (or just one) of the species. This would mean 
generalizing the Gillespie simulation algorithm to force all trajectories to pass 
through the count observations available for each species. As opposed to the 
simplicity of the original Gillespie algorithm, this additional complication 
renders the missing data simulation very complex and apparently no direct 
simulation method is available. [52] attempt two algorithms to approach 
this missing data simulation creating a rather complex MCMC. Even in the 
case of a very simple two species three reaction CME they obtain partial 
success when considering some prior settings and it is indeed not clear how 
this MCMC will generalize to other different or more complex CME's [32] . 

Since simulating data is possible, the likelihood free ABC inference ideas 
may be used [2S]. However, as mentioned above, the ABC lacks of a firm 
theoretical support and many crucial details remain unknown, namely, what 
regularity conditions are required (if any) for the ABC to work, how to 
choose the tolerance and metric, and the consequences of not using a sufficient 
statistic [2H]- Moreover, when the number of reactions per unit of time 
increases the Gillespie algorithm may become very inefficient. 

Here we take a novel approach to performing Bayesian inference for pa- 
rameters in the CME, different in many respects to the previous approaches 
outlined above. We use the moment approximations in (j6])-([7]) to impute 
a counting data model for available observations matching those moments, 
to create an imputed likelihood to perform our inference. That is, instead 
of considering the full Bayesian missing problem approach, we build an ap- 
proximate data model from which inference is much simplified. Nonetheless, 
our examples show that inference and trajectory prediction in the true CME 
is possible when using our approximate likelihood. Although the Kramer- 
Moyal moment approximation would provide similar results as ([6|)-([7|) [see 
l33l chapter 7], the van Kampen approximation outlined in Section pThas a 
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firmer theoretical background, and will apply for rate functions common in 
'chemical' reactions (as those considered here). Furthermore, the only crit- 
ical assumption is that the system (the total number of species counts) is 
large [33], which is precisely the case in epidemic models. That is, individ- 
ual counts can indeed be low at specific time points, but the system size fl 
remains large (say > 100), which in closed population epidemic models (like 
the SIR model) remains constant and equal to the population size. There 
is also the moment closure approach [311 SS] for which a Bayesian inference 
has also been attempted [3lJ, but assumes a specific moment structure and 
is neither suited to handle low species counts. 

3.1. Approximate Likelihood 

We assume the difficult case in which only one species is observed. In 
most applications of the SIR model explained in Section |2] only the number 
of Infectious individuals at some fixed time points are observed (in fact, in 
some real situation only an approximation for the latter is available, see 



Section 3.3). That is, let ti,t2, ■ ■ ■ ,tn be some observation time points and 
D = {x2{ti),X2{t2), ■ ■ ■ ,X2{tn)) the data available. 

We define a Generic Discrete Distribution Gd{m,v), that is a combina- 
tion of the commonly used counting data models, namely, the Binomial, the 
Poisson and the Negative-Binomial, and is explained in detail in [16]. For 
any mean and variance fi,v > we make a combination of these three dis- 
tributions in the following way 

e-^^i; if fi = v (8) 

where x G N, » = 1 — mini — , — |, m = and are the combina- 

tions of m items taken in subsets of size x. That is, we use a Binomial if 
fi > V, a. Poisson ii fi = v and a Negative-Binomial if /i < f . Neither of these 
distributions can handle any mean and variance; by combining these distri- 
butions we obtain the Generic Discrete class Gd{fi, v) defined for arbitrary 
mean fi and variance v, n,v > 0, and these two moments completely define 
the distribution. Indeed, it is straightforward to see that if X ~ Gd{fi,v), 
E{X) = /i and V{X) = v. More importantly, for a fixed mean /i, given 
both the properties of the Binomial and the Negative-Binomial, we see that 
\imy^^Gd(x\ii,v) = Therefore we have a continuous evolution of this 
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parametric class, being the Poisson the "continuous bridge" between the Bi- 
nomial (/i > v) and Negative Binomial (yU < v). (Note that if fi > v and 
f — /i, the support will increase to cover all N since m — )■ oo.) Moreover, if 
X ~ Gd{fi,v), will tend to a standard Normal distribution if /i — )■ oo 
and p ^ po E (0,1). That is, for large /i, large counts, (and for example 
/i — 3y/v > 0) Gd{^, v) can be approximated with a N{fi, v). 

From we assume that E[X2(t) |6o5 ^1) ^] = i^t and V[X2(t)|6o5 ^ii ^] = 

Vt are readily available be numerically solving the referenced system of dif- 
ferential equations. We assume that 

X2{U)\hoM,^-Gd{mu,vu) (9) 

(If observations for more species were available, a multivariate generic dis- 
crete distribution is also explained in (author?) |16] which can be used to 
define their joint distribution using the cross moments in ([7])). 

Certainly, observations are taken at time points and since we are dealing 
with a Markov process these are not independent in general. The time lagged 
cross moments E[X2(t)X2(t + /i)] would be very useful in defining the discrete 
joint distribution for observations, but these moments cannot be calculated 
in any way similar to the moments in ([6])-([7]). Note, however, that the model 
in ^ is only an approximation, imputed to match the available moments 
and, as approximations go, we may as well take their joint distribution as 

n 

l{B;bo,h,n) = l[Gd{x2{U)\mt^,VtJ. 

i=l 

Consequently our approximate likelihood is defined as the product of individ- 
ual distributions, as if the observations were independent. We present in the 
next section the corresponding prior and posterior distributions. The use and 
validity of this approximate likelihood will only be pondered when we show 
to recover the true parameter values in a synthetic data scenario. Section |4| 
and prove its predictive capability in real data examples, in Section [5j 

3.2. Posterior distribution, MCMC and prediction 

Since b^, bi and Q are positive parameters, we use a Gamma distribution 
as a default and flexible family of priors for these parameters, namely 

bo Ga{ao, (3q), bi Ga{ai, (3i) and fl Ga{a2, (32) , (10) 
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for known hyper-parameters ai, /3i. The log-posterior distribution is therefore 

n 

log f {bo, h, = C + 5^ log Gd{x2{U)\mt^, VtJ 

i=l 

1 

+ -l)log(6,) 

j=0 

+ (a2 - 1) log(fi) - f32n, 

for some unknown normalizing constant C, bQ,bi,Q > 0. In some situations 
the total population size fl is known and the posterior distribution will de- 
pend on the parameters bo and bi only by fixing the (log) posterior to Q = N, 
for some fixed population size N. 

The likelihood depends on the parameters bo,bi,Q through the moments 
mt-,Vt-. For any fixed parameter settings, evaluating the likelihood therefore 
represents numerically solving the differential equations in (|6])-([7]). Indeed, 
no analytical alternative is available and therefore we resort to MCMC sim- 
ulation techniques [47j. 

Since no analytical version is available for the posterior distribution, the 
full conditional distributions cannot be calculated and usual approaches like 
the Gibbs sampler are impossible to implement. Conventional Metropolis- 
Hastings MCMC, like a random walk MH [17], are difficult to design and 
use. Even a fine tuned MCMC for a particular example may fail radically 
with the simplest modification. We use instead a self adjusted MCMC that 
automatically tunes itself and has been proved to be of great use in a variety 
of situations. Namely, we use the t-walk |1H] MCMC sampler (for continuous 
parameters) which only requires the log posterior distribution and two initial 
points for the parameters of interest. In many examples, those shown below 
and many others, the t-walk works quite well in this two or three parameter 
MCMC simulation problem. 

It is very interesting and useful to predict future observations of species; 
specially in the case of the epidemic example presented in Section [5| where 
the ability to predict the evolution of the epidemic in the near future may 
well be a crucial piece of information for public health decision makers. In 
Bayesian statistics the marginal posterior distribution of observables is cal- 
culated to make predictions (i.e.. the posterior predictive distribution). In 
this approximate likelihood setting and once the (t-walk) MCMC sample 
for our parameters is available, b'^o\^^i\^'^^\^ — 1)2,..., if, simulating 
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posterior predictive samples for X2(tn+i) is straightforward by simulating 
from G(i(m^^|^, That is, for each of the specific parameter settings 
b'^\ b['^\ ^l''^^ the moment approximations are also calculated at the future 
time tn+i and a (predictive) simulated value x''2'\tn+i) is generated by simu- 
lating from the Generic Discrete distribution, G'(i(m|^|^, explained in 
Section 13.11 

3.3. Handling surveillance epidemic data 

Our aim is to use epidemics surveillance data to make our inferences. 
According to Diekmann and Heesterbeek [39] prevalence is the number of 
cases notified up to a given time t, and time is measured in some arbitrary 
scale. On the other hand, the incidence is the expected value of the number 
of new cases per unit time. In order to circumvent the difficulty of using 
surveillance data to estimate the model parameters we have assumed that 
data is proportional to incidence. 

We assume that the actual number of infectious individuals at reporting 
times satisfies X2{ti) = kr{ti) where k is an unknown constant and r{ti) are 
the number of reported cases at time ti. That is, the number of infectious 
people is proportional to the number of reported cases. Moreover, the ini- 
tial number of susceptibles xi{0) = Q — X2{0) is also unknown. Since k is 
unknown, it is confounded with fl. We simply take Q as an unknown (now 
instrumental) parameter in our model and take X2{ti) = r{ti). This prag- 
matic approach avoids further complicating the model and inference and in 
our synthetic data examples permits reconstructing bo and bi even when Q 
is also unknown. 

In the examples that we study below cases are reported monthly or weekly 
and are confirmed through an official detection algorithm approved by the 
Federal Health Secretariat of Mexico. In this respect the Cuernavaca data 
is probably the one better monitored since information on each case is more 
complete (age, sex, address, reference numbers, dates of admission to health- 
care, etc are provided). Nonetheless, both examples are only meant to test 
our methods and therefore show probably two typical cases of the extremes 
of data quality and availability. In a forthcoming paper we will be more 
concerned with the actual population dynamics of the disease where a more 
detailed data selection and treatment is necessary. 
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4. Example: Synthetic Data 

In this section we offer synthetic examples aiming at showing the accuracy 
and predictive features of the inference methods developed in the previous 
sections. For the sake of conciseness, we restrict ourselves to two contrasting 
cases. First, we set Q = 200 and use many data points to solve the inference 
problem, see Figure [T] In the second example Q is regarded as unknown 
(also a parameter) and many data points are trimmed out of the analysis. 
Furthermore, some data points are predicted, see Figure |2} In both examples 
synthetic data was created with the Gillespie algorithm mentioned above and 
fixed parameter values to 60 = 40 and 61 = 7. 

By using the formalism of the CME we are allowing for stochastic fluc- 
tuations in the state variables that, in the case of the SIR model, refer to 
demographic stochasticity due to individual differences in contact rates (60) 
and time of infectiousness (61). In Figure [l](d) we present the posterior dis- 
tribution for Ro which, therefore, approximates the distribution of Ro among 
individuals in the population. The Maximum a posteriori (MAP, the pos- 
terior mode) estimator for Rq is 6.0, with (4.8, 7.0) as Highest Posterior 
Density 95% probability interval [HPD, an interval accumulating 0.95 prob- 
ability with minimum length, the Bayesian equivalent of 'credible' intervals, 
see ED! • Model predictions follow the true epidemic behavior with reasonable 
accuracy (see Figure [T]^a)). In this experiment the full data set was used. 

To test for robustness we performed a new run but trimming away the 
last 14 weeks of the epidemic and considered the total population size Q, 
as unknown. Essentially we are only using the information necessary for 
the estimation of Rq. Model predictions are presented in Figure [2]^a). The 
capacity of the model to predict the epidemic tail is remarkable although, 
as expected, the uncertainty on the estimation of Ro increases relative to 
the previous case, moreover, its posterior distribution is quite skewed, see 
Figure [2|^d). However, although in this case the MAP estimator for Rq is 2.9, 
its posterior mean is 6.3, quite similar to the estimator of Rq in the complete 
data, known Q, previous example. The 95% HPD probability interval is 
(2.9, 16.2). Note that for this skew distribution a single estimator is a very 
bias representation of the posterior distribution. Unlike all other posterior 
densities presented here the MAP and posterior mean differ in this case. The 
posterior distribution itself should always be used for a better interpretation 
in this Bayesian analysis [a point that went unnoticed in |5T] ■ 
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l.O 



1.0 



l.O 



(a) 




30 35 40 45 50 55 
Infectious contacts/time 



(b) 




6.5 7.0 7,5 8.0 8,5 9,0 



(c) 




3 4 5 6 7 



(d) 



Figure 1: Synthetic data with 60 = 40, bi = 7 and N = 200. (a) True model 
(blue) and MAP estimated model (green) and 5% and 95% quantile bands 
for the imputed Gd distribution at any time. Posterior distributions for 60 
(b), bi (c) and Rq (d). 
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70 25 30 35 40 45 50 55 
Infectious contacts/time 




10 15 20 25 30 



(b) 



(c) 




10 20 30 40 50 60 70 



(d) 



Figure 2: Synthetic data with = 40, hi = 7 and = 200, 14 observations 
trimmed away from the end. (a) True model (blue) and MAP estimated 
model (green), 5% and 95% quantile bands and box-plots for the posterior 
predictive distribution at two future points (actual data in *). Posterior 
distributions for (b), hi (c) and Rq (d). 
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Table 1: Reported data on the basic reproduction number for Dengue from 
several sources. Adapted from Hsieh and Chen [51] 



Source 


Rq or range 


Confidence interval 95% 


Hsieh and Ma 155J 


2.23 


(1.47,3.00) 


Hsieh and Chen [54] 


3.93-4.67 




Koopman et al [56] 


1.33 - 2.40 




Marques el al [57] 


1.6 - 2.4 




Khoa et al [58] 


1.25 - 1.75 




Chowell et al pj 


3.09 


(2.34 - 3.84) 


Chowell et al [52j 


2.0 


(1.75-2.23) 


Chowell et al [59j 


1.76 




Massad et al [6ClJ 


1.9 





5. Example: Dengue Fever Outbreaks 

There are numerous studies on the estimation of the basic reproductive 
number for Dengue. Many of these works look at the local initiation of the 
outbreak to estimate directly the basic reproductive number (e.g., Chowell, 
Diaz-Duenas, Miller et al, [S^; Mendes-Luz, Torres-Codeco, Massad et al, 
[S3] ) or use a phenomenological approach to approximate its upper bound 
(Hsieh and Chen, [SI]; Hsieh and Ma, [SS]). Here we are able to estimate 
the basic reproduction number and the fate of the epidemic for the Dengue 
outbreaks in Acapulco, Mexico, in 2005 and Cuernavaca, Morelos in 2008 
using a full differential equations model as explained earlier. 

In Table [Tj we list some estimates for the reproductive number for Dengue 
with their corresponding confidence intervals (when available) and for differ- 
ent cities. The reader is referred to the cited reference for further details. We 
have omitted those references whose estimations show very large variability. 

In the 2005 and 2008 epidemic outbreaks recorded for the cities of Aca- 
pulco, state of Guerrero, and Cuernavaca, state of Morelos, Mexico, re- 
spectively only one viral strain was identified (no co-circulation of various 
serotypes). In 2005 DEN-1 and DEN-2 in Guerrero was isolated (Carrillo- 
Valenzo, Danis-Lozano, Velasco-Hernandez et al, [61] and data from Di- 
reccion General de Epidemiologia) and in 2008 in Morelos it was DEN-2 
(Direccion General de Epidemiologia) 

By using the SIR model we are neglecting the role of the population 
dynamics of the mosquito during the epidemic outbreak and are, therefore. 
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incorporating its effects as part of the constant contact rate bi. Our ratio- 
nale behind this assumption is the following: during the epidemic period the 
density of mosquito increases substantially. Smith, Dushoff and McKenzie 
[62] have shown that during an epidemic outbreak the human biting rate 
is highest shortly after mosquito density peaks and that the proportion of 
infected mosquitoes peaks when mosquito population density is declining. 
On the other hand Sanchez, Vanlerberghe, Alfonso et al [63] show data that 
confirms that mosquito abundance is strongly correlated with an epidemic. 
There have been reports for Rio de Janeiro, however (Honorio, Nogueira, 
Codeco et al, [M]), that have shown that the time of highest case incidence 
is not associated with a local increase of vector abundance, observation that 
can be caused by the movement patterns of the population. In this work we 
are looking at a geographical scale that comprises large metropolitan areas 
where the environment for transmission, during an epidemic, can be thought 
of as saturated with mosquito although at a local level (that of city district, 
block or house), it might not be the case. A high mosquito density also im- 
plies a high turnover rate of mosquito generations that result in a constant 
availability of female adult mosquitoes for disease transmission. This obser- 
vations justify the treatment of a Dengue epidemic outbreak as if it were the 
epidemic outbreak in a directly transmitted disease. Of course, this hypoth- 
esis breaks down in the interepidemic period also known as the "endemic" 
state where low prevalence, asymptomatic infections and vector abundance 
are subject to stochastic environmental effects as well as to the seasonal forc- 
ing induced by the rainy season on the vector population. Consequently, we 
limit our estimation to the length of time of the epidemic outbreak in each 
one of the cases analyzed. 

In summary we use the susceptible, infectious, recovered (SIR) Ker- 
mack McKendrick model as an approximation for the Dengue dynamics (e.g., 
Adams, Holmes, Zhang et al, [65j) for only one epidemic period. We therefore 
disregard demographic influences since births and deaths may be considered 
to have negligible effects on disease dynamics during this short time period 
and assume the mosquito population dynamics is constant and plays a role 
as a factor in the contact rate. 

We rewrite now model ([2])-(|4]) with the standard epidemiological notation 
where bo is the contact rate, I/7 is the length of the infectious period and 
N = Hi + 1/2 + ys is the total population density: 
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h ^2 




= -&o^2/i 






2/2 


= h,-y, - 


2/3 


= 72/2- 



(11) 

72/2 (12) 

(13) 

The basic reproduction number for this system is 

-Ko — — , 

7 

where we have rescaled the total population to be = 1. For comparison 
the simplest vector transmitted disease based on the Ross-Macdonald model 
renders a reproductive number of the form 

abm 

where a and b are the mosquito and human biting rates, m is the ratio of 
female mosquitoes to human hosts and 1/6 and I/7' are the infectiousness 
periods for mosquito and human respectively. 

It is obvious that the infection rate in the SIR model, I/7, includes the 
total time available per capita for infection in both mosquito and human 
hosts; also the contact rate 60 in the Kermack-McKendrick model aggregates 
the combined effect of the product ab in R^. Therefore the estimates of 
Rq that we are about to obtain by applying our methods, will not discrim- 
inate the different components of this threshold parameter. Our methods, 
however, can be applied directly to more complicated models that explicitly 
incorporate the relevant parameters for the human and vector populations if 
necessary. 

5.1. Acapulco 2005 and Cuernavaca 2008 outbreaks 



Here we apply our methods to the two data sets described in section (3.3 ). 
For completeness we provide now a brief description of the basic life-cycle 
facts of the Dengue virus. The endemic-epidemic cycle of Dengue has the 
following characteristics. After an infected human host is bitten the viral 
incubation period within the mosquito lasts about 7-14 days after which the 
mosquito becomes infectious for its entire lifespan which is at most 25 days. 
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Figure 3: Inverse Gamma prior distribution for pi = I/7 = 1/bi for our 
Dengue outbreaks examples. 



In the human host the incubation period last form 3-14 days (with an av- 
erage of 7). The infectious period lasts for about 2-10 days after outset of 
symptoms. Values on biting rates are not available and these are the main 
components that are estimated or circumvented when dealing with the basic 
reproduction number (see Chowell, Diaz-Duefias, Miller et al, [52] and Hsieh 
and Chen, ^Mj)- Since we do not have information for p = I/60 we use a uni- 
form prior over a very long range (not shown). The prior information for the 
infectious period pi = 1/bi is shown in Figure |3} Recall that pi = I/7 is ag- 
gregating the infectious period of both the human host and vector discounted 
by the time allocated to viral incubation. To see this consider the following 
rough approximation: take the average human infectious period of 7, and 
the mosquito infectious period of about 10 days. The total time available 
for transmission is 7 x 10 = 70; the proportion of the mosquito and human 
lifespan available to infection given the extrinsic and intrinsic incubation pe- 
riods is the weighting factor that should be incorporated into I/7. Suppose 
an extreme scenario where all the mosquito lifespan is infectious and that 
the human intrinsic incubation period is 10 days; then roughly the propor- 
tion of time available for closing a transmission cycle human-vector-human 
is 1 X 1/10 = 0.1 which when multiplied by 70 gives the mode assigned to the 
prior for pi = I/7 shown in Figure |3} For this inverse gamma distribution 
for pi = 1/7 = 1/bi the shape parameter is set to 9 and the scale parameter 
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Figure 4: (a) Daily reported cases of Dengue fever in Cuernavaca, Mexico, 
2008, with several observations trimmed away at the end. Estimated MAP 
model and three future fortnights of reported cases predicted (box-plots of 
predictive distribution), actual data with plotted with *. (b) Posterior dis- 
tribution for 6o (c) Posterior (histogram) and prior distribution for pi = 1/bi 
and (d) posterior distribution of Rq = 6o/7- The approximated posterior 
densities with the full data are also presented (black). 



is 8 (9-1). Its mean is therefore 8 (days; these are in fact equal to ai and 
Pi, given the shape and scale parametrization used for the Gamma prior for 
6i = 7 in 

For the Cuernavaca outbreak the MAP estimator of the reproductive 
number Rq is 1.9 with (1.8,2.2) as HPD 95% probability interval, when Au- 
gust to December 2008 are not considered in the estimation. The prediction 
of the fate of the epidemic is rather impressive, capturing basically all the 
data variability (see Figure |4]). Intentionally, we chopped off data at the end 
of a descending fluctuation run, after the epidemic maximum, in early August 
2008. The descending run did not fool our predictions of the future evolution 
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of the epidemic (see the predictive distribution Box-Plots in Figure |4](a)). At 
that point, reasonable estimates for bo and bi were already available and 
good predictions could be done, taking into account all modeled sources of 
variability. 

For the Acapulco outbreak the MAP estimator of the reproductive num- 
ber Rq is 1.3 with (1.1,1.6) as HPD 95% probability interval when data 
starting in November 2005 are not considered in the estimation. The predic- 
tion of the fate of the epidemic is quite good considering the few data points 
used in the analysis (see Figure [5]). 

The estimate for Acapulco is in the lower range of values reported in 
Table [!} In fact, only the lower bound for the range in Khoa et al [SH] 
is comparable in magnitude. The estimate for Cuernavaca is located more 
within the range reported by several studies. 

We remark that the scarce (monthly) data from Acapulco leads to less 
informative predictions, hiding the intrinsic stochasticity of the state vari- 
ables. In contrast data from Cuernavaca are reported weekly. See Figure |4] 
There are two possible explanations for this trend. Either, there might be 
two waves of Dengue, one starting in early April and other developing in 
early August 2008. The SIR model is unable to capture this dynamics in 
the local detail (late July-early August) but it is able to capture the tail of 
the epidemic outbreak reasonably well (see Figure |4^). If this hypothesis of 
two Dengue waves was true, the second wave of the epidemic corresponds 
may correspond to a different set of parameters, that increases the incidence 
immediately starting in August. On the other hand, the decreasing trend in 
August may be explained by demographic stochasticity alone. 

6. Conclusions 

With respect to this approach as an useful inferential and predictive 
framework for epidemic surveillance data, Breban, Vardavas and Blower [66] 
present a critique to estimating Rq using population level models. These au- 
thors claim that using an individual-level model approach to estimate Rq is 
more accurate and may not coincide with the Rq estimated from a population 
level model. In the approach we present here, the incorporation of the CME 
formalism introduces demographic stochasticity, i.e., individual stochastic 
variation into the epidemic parameters bo and bi thus actually approximat- 
ing individual level characteristics. The methods shown here do generate 
distributions of the basic parameters, including Rq based on the a priori 
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Figure 5: Reported cases of Dengue fever in Acapulco, Mexico, 2005, with 6 
observations trimmed away at the end. Estimated MAP model and three fu- 
ture months of reported cases predicted (box-plots of predictive distribution), 
actual data plotted with *. (b) Posterior (histogram) and prior distribution 
for po = l/^o- (c) Posterior (histogram) and prior distribution for pi = 1/bi 
and (d) posterior distribution of Rq = 6o/7- 



22 



information available. To reinforce our point the results by Lloyd-Smith, 
Schreiber, Kopp and Getz [67] show that Rq has an highly skewed distri- 
bution in many diseases when sampled from individuals through contact- 
tracing. Certainly our results Figures ^ and ([s]) do show (admittedly only 
slightly) skewed distributions that indicates the individual variability in con- 
tact and recovery rates at the individual level. This variability does not come 
from sampling errors but from an intrinsic property of stochastic individual 
variation constructed into the model via the CME. Our estimation approach 
uses this built-in demographic stochasticity to estimate Rq. We are using 
a very simple model, the SIR without vital dynamics and this may explain 
the relatively low variability reflected in our nearly symmetric distributions: 
the model is smoothing out individual differences. We are using the simplest 
priors that are biologically reasonable. 

In general, we expect to generate highly skewed distributions for at least 
one of our parameters when, for example, few data points are available and 
the uncertainty in estimating bi is such that its posterior distribution borders 
on zero, the distribution of Rq = bo/bi will result heavily skew with a long 
tail (as in Figure [2]^d)). 

Indeed, the SIR model presented here is quite simple, and specially ap- 
plied to (vector induced, delayed, etc.) Dengue. Nevertheless, we obtained 
interesting inferences and prediction power from limited data, in real and 
synthetic scenarios, and the model simplicity servers us to present and ex- 
periment with our inference procedure. Our inference framework can indeed 
be generalized to be used in a more complex model (as explained in Sec- 
tion |2]), were for example the vector- human interaction is modeled in a more 
comprehensive way. We leave this for further research. 
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